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Abstract 

As an increasing number of genome-wide association studies reveal 
the limitations of attempting to explain phenotypic heritability by single 
genetic loci, there is growing interest for associating complex phenotypes 
with sets of genetic loci. While several methods for multi-locus mapping 
have been proposed, it is often unclear how to relate the detected loci 
to the growing knowledge about gene pathways and networks. The few 
methods that take biological pathways or networks into account are either 
restricted to investigating a limited number of predetermined sets of loci, 
or do not scale to genome-wide settings. 

We present SConES, a new efficient method to discover sets of genetic 
loci that are maximally associated with a phenotype, while being con- 
nected in an underlying network. Our approach is based on a minimum 
cut reformulation of the problem of selecting features under sparsity and 
connectivity constraints, which can be solved exactly and rapidly. 

SConES outperforms state-of-the-art competitors in terms of runtime, 
scales to hundreds of thousands of genetic loci, and exhibits higher power 
in detecting causal SNPs in simulation studies than existing methods. 
On flowering time phenotypes and genotypes from Arabidopsis thaliana, 
SConES detects loci that enable accurate phenotype prediction and that 
are supported by the literature. 

1 Introduction 

Twin and family /pedigree studies make it possible to estimate the heritability 
of observed traits, that is to say the amount of their variability that can be at- 

* chloe-agathe . azencottatuebingen . mpg . de 

t Machine Learning and Computational Biology Research Group, Max Planck Institute for 
Developmental Biology & Max Planck Institute for Intelligent Systems Spcmannstr. 38, 72076 
Tubingen, Germany 

■f The Institute of Scientific and Industrial Research (ISIR) Osaka University 8-1 Mihogaoka, 
Ibaraki-shi, Osaka 567-0047 Japan 

§Zentrum fur Bioinformatik, Eberhard Karls Universitat Tubingen, 72076 Tubingen, Ger- 
many 



tributed to genetic differences. In the past few years, genome- wide association 
studies (GWAS), in which several hundreds of thousands to millions of single 
nucleotide polymorphisms (SNPs) are assayed in up to thousands of individu- 
als, have made it possible to identify hundreds of genetic variants associated 
with complex phenotypes (Zuk et al. 2012). Unfortunately, while studies as- 



sociating single SNPs with phcnotypic outcomes have become standard, they 



often fail to explain much of the heritability of complex traits (Manolio et al. 



20091. Investigating the joint effects of multiple loci by mapping sets of genetic 



variants to the phenotype has the potential to help explain part of this missing 



heritability (Marchini et al. 2005) 



While efficient multiple linear regression approaches ( Cho et al. 2010 Wang 



et al. 2011 Rakitsch et al. 20121 make the detection of such multivariate as- 



sociations possible, they often remain limited in power and hard to interpret. 
Incorporating biological knowledge into these approaches could help boosting 
their power and interpretability. However, current methods are limited to pre- 



defining a reasonable number of candidate sets to investigate (Cantor et al. 



2010 Fridley and Biernacka 2011 Wu et al. 2011), for instance by relying on 



gene pathways. They consequently run the risk of missing biologically relevant 
loci that have not been included in the candidate sets. This risk is made even 
likelier by the incomplete state of our current biological knowledge. For this 
reason, our goal here is to use prior knowledge in a more flexible way. We pro- 
pose to use a biological network, defined between SNPs, to guide a multi-locus 
mapping approach that is both efficient to compute and biologically meaning- 
ful: We aim to find a small set of SNPs that (a) are maximally associated with 
a given phenotype and (b) tend to be connected in a given biological network. 
In addition, this set must be computed efficiently on genome-wide data. In this 
paper we assume an additive model to characterize multi-locus association. The 
network constraint stems from the assumption that SNPs influencing the same 
phenotype are biologically linked. However, the diversity of the type of rela- 
tionships that this can encompass, together with the current incompleteness 
of biological knowledge, makes providing a network in which all the relevant 
connections are present unlikely. For this reason, while we want to encourage 
the SNPs to form a subnetwork of the network, we also do not want to enforce 
that they must form a single connected component. Finally, we stress that the 
method must scale to networks of hundreds of thousands or millions of nodes. 
Approaches developed to analyze gene networks containing hundreds of nodes, 



such as that of Nacu et al. (2007), Chuang et al. (2007) or Li and Li (2008), do 



therefore not apply. 

While our method can be applied to any network between genetic markers, 
we explore three special types of networks here (see Figure [lj : 

• Genomic sequence network (GS): SNPs adjacent on the genomic sequence 
are linked together. In this setting we aim at recovering sub-sequences of 
the genomic sequence that correlate with the phenotype. 

• Gene membership network (GM): SNPs are connected as in the sequence 
network described above; in addition, SNPs near the same gene are linked 
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together as well. Usually, a SNP is considered to belong to a gene if it 
is either located inside said gene ore within a pre-defined distance of this 
gene. In this setting we aim more particularly at recovering genes that 
correlate with the phenotype. 

• Gene interaction network (GI): SNPs are connected as in the gene mem- 
bership network described above. In addition, supposing we have a gene- 
gene interaction network (derived, from example, from protein-protein 
interaction data or gene expression correlations), SNPs belonging to two 
genes connected in the gene network arc linked together. In this setting, 
we aim at recovering potential pathways that explain the phenotype. 

Our task is a feature selection problem in a graph-structured feature space, 
where the features are the SNPs and the selection criterion should be related to 
their association with the phenotype considered. Note that our problem is dif- 
ferent from subgraph selection problems such as those encountered in chemoin- 
formatics, where each object is a graph and each feature is a subgraph of its 



own (Tsuda 2011) 



Several approaches have already been developed for selecting graph-structured 



features. A number of them (Le Saux and Bunke 2005 Jie et al. 2012) only use 



the graph over the features to build the learners evaluating their relevance, but 
do not enforce that the selected features should follow this underlying structure. 
Indeed they can be applied to settings where the features connectivity varies 
across examples, while here all individuals share the same network. 



The overlapping group Lasso {Jacob et al. 2009 Liu et al. 2012 ) is a sparse 



linear model designed to select features that belong to the union of a small 
number of predefined groups. If a graph over the features is given, defining those 
groups as all pairs of features connected by an edge or as all linear subgraphs of a 



given size yields the so-called graph Lasso. A similar approach is taken by Huang 



et al. (2009): their structured sparsity penalty encourages selecting a small 



number of base blocks, where blocks are sets of features defined so as to match 
the structure of the problem. In the case of a graph-induced structure, blocks are 
defined as small connected components of that graph. As shown in |Mairal and| 
Yu. (2011 1, the overlapping group Lasso mentioned above is a relaxation of this 



binary problem. As the number of linear subgraphs or connected components of 
a given size grows exponentially with the number of nodes of the graph, which 
can reach millions in the case of whole genome SNP data, only the edge-based 
version of the graph Lasso can be applied to our problem. It is however unclear 
whether it is sufficient to capture long-range connections between graph nodes. 



Li and Li (2008) propose a network-constrained version of the Lasso that 



imposes the type of graph connectivity we deem desirable. However, their ap- 
proach has been developed with networks of genes (rather than of SNPs) in mind 
and does not scale easily to the data sets we envision. Indeed, the implementa- 
tion they propose relies on a singular value decomposition of the Laplacian of 
the network, which is intensive to compute and cannot be stored in memory. 



Chuang et al. ( 2007 ) also searched subnetworks of protein-protein interaction 



networks that are maximally associated with a phenotype; however, their greedy 
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(a) Genomic sequence network: SNPs adjacent 
on the genomic sequence are connected to each 
other. 
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(b) Gene membership network: In addition, 
SNPs near the same gene (i.e., within a speci- 
fied distance of that gene) are connected. 



genel 




(c) Gene-interaction network: In addition, SNPs 
near two interacting genes are connected. 

Figure 1: Small examples of the three types of networks considered. 
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approach requires to fix beforehand a (necessarily small) upper-limit on the size 
of the subnetworks considered. 

In the case of directed acyclic graphs, Mairal and Yu. (2011 1 propose a min- 
imum flow formulation that make it possible to use for groups (or blocks) the 
set of all paths of the network. Unfortunately, the generalization to undirected 
graphs with cycles, such as the SNP networks we consider, requires to randomly 
assign directions to edges and prune those in cycles without any biological jus- 



tification. Although this can work reasonably well in practice ( Mairal and Yu. 



2011 1, this is akin to artificially removing more than half of the network con- 



nections without any biological justification. 

In what follows, we formulate the network-guided SNP selection problem as 
a minimum cut problem on a graph derived from the SNP network in Section [2] 
and evaluate the performance of our solution both in simulations and on actual 
Arabidopsis thaliana data in Section [3j 



2 Methods 



2.1 Problem Formulation 

Let n be the number of SNPs and m the number of individuals. The SNP-SNP 
network is described by its adjacency matrix W of size n x n. 



et al. 



A number of statistics based on covariance matrices, such as HSIC (Gretton 

2011), can be used to compte a measure of 



2005| ) or SKAT flWu et al. 
dependence c € R n between each single SNP and the phenotype. Under the 
common assumption that the joint effect of several SNPs is additive (which cor- 
responds to using linear kernels in those methods) , c is such that the association 
between a group of SNPs and the phenotype can be quantified as the sum of 
the scores of the SNPs belonging to this group. In other words, given an indi- 
cator vector / € {0, 1}™ such that, for any p £ {1, • • • , n}, f p is set to 1 if the 
p-th SNP is selected and otherwise, the score of the selected SNPs is given by 

Q(f) = E n P =ic P f P = c T f- 

We want to find the indicator vector / £ {0, 1}™ that maximizes the score 
Q(f) while ensuring that the solution is made of connected components of the 
SNP network. However, in general, it is difficult to find a subset of SNPs 
that satisfies the above two properties. In fact, given a positive integer k, the 
problem of finding a connected subgraph with k vertices that maximize the 
sum of the weights on the vertices, which is equivalent to Q(f) of our case, is 
known to be a strongly NP-complete problem (Lee and Dooly 1996 ). Therefore, 



this problem is often addressed based on enumeration-based algorithms, whose 
runtime grows exponentially with k. To cope with this problem, we consider an 
approach based on a graph-regularization scheme, which allows us to drastically 
reduce the runtime. 
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2.2 Feature Selection with Graph Regularization 



Rather than searching through all subgraphs of a given network, we reward 
the selection of adjacent features through graph regularization. As it is also 
desirable for biological interpretation and to avoid selecting large number of 
SNPs in linkage disequilibrium, that the selected sub-networks are small in size, 
we reward sparse solutions. The first requirement can be addressed by means of 



a smoothness regularizer on the network (Smola and Kondor 2003 Ando and 



Zhang 2007), while the second one can be enforced with an la constraint: 



argmax tT/ - A f T ^Lf ~V_^f\\o W 

association connectivity sparsity 

where L is the Laplacian of the SNP network. L is defined as L — D—W, where 
D is the diagonal matrix where D PtP is the degree of node p. Note that here, 
we directly minimize the number of non-zero entries in / and do not require 
the proxy of an l\ constraint to achieve sparsity (of course in the case of binary 
indicators, l\ and Iq norms are equivalent). 

A and r\ are positive parameters that control the importance of the connect- 
edness of selected features and the sparsity regularizer, respectively. 

Since W Ptq = 1 if q is a neighbor of p (also written as p ~ q), and otherwise, 
if we denote by J\f(p) the neighborhood of p, then the degree of p can be rewritten 
Dp.p = 2geA/"(p) 1- The secon( l term in Eq. |IJcan therefore be rewritten as 

/ T Z/ = ]T(/ p -/ 9 ) 2 , (2) 
and the problem in Eq. ([I]) is equivalent to 

n 

argmax V/p(c p - 77) - A V(/ p - / 9 ) 2 . (3) 
/e{o,i}" p=1 p „ q 

As (f p — fq) 2 is 1 if f p 7^ f q and otherwise, it can be seen that the second 
term penalizes both selecting SNPs that are not connected to one another and 
selecting only subsets of connected components in the SNP network. Also, as 
II/Hq = 1„/ in our case, the third term is equivalent to reducing the individual 
SNP scores c by a constant 77 > 0. 



2.3 Min-Cut Solution 

Given a graph with vertices V := {1, . . . , n} and adjacency matrix W £ R nxra , 
a cut C(S, V\ S) (S C {^\0}) is defined as a partition of the graph. The corre- 
sponding cut function C : {0, 1}" — > R is defined as C(f ) = J2 P q =i W Ptq f p (l — 
f q ) where f p is 1 if p € £ and otherwise. Also, a s/i-cut C(S, V\ S) is defined 
as a cut such that s £ S and t £ V\S, where s and t in V are respectively called 
the source and the sink of the network. Given a cut C(S, V\S), a set of all pairs 
(u, v) for u £ S and v £ V \ S with positive weight Wij is called the cut-set of 
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(if cp > n) 



original graph 



Figure 2: Graph for the s/i-min-cut formulation of the selection of networks of 
genetic markers. 



cut C(S,V\S). Then, for a fixed s, t G V, the problem of finding a s/t-cut that 
gives a maximum sum of weights on its cut-set is called the s/t min-cut prob- 
lem. It is well known from the max-flow-min-cut theorem ( Papadimitriou and 



Steiglitz 1982 1 that the s/t min-cut problem can be solved efficiently using the 



maximum-flow algorithm (Goldberg and Tarjan 1988). In our implementation 



we use the Boykov-Kolmogorov algorithm (Boykov and Kolmogorov 2004). Al- 
though its worst case complexity is in 0(n 2 riEnc), where ue is the number of 
edges of the graph and nc the size of the minimum cut, it performs much better 
in practice. 

Proposition 1 Given a graph Q with adjacency matrix W , the graph-regularized 
maximization of score Q{*), i.e., problem ([!]), is equivalent to a s/t -min-cut for 
a graph with adjacency matrix M and two additional nodes s and t, where 
Mp.q = ^W Pl q for 1 < p, q < n and where the weights of the edges adjacent to 
nodes s and t are defined as 



M x 



_ Jc p - rj ifc p > r\ 
otherwise 



id M t>p = 



T] — Cp if c p < n 
otherwise 



(p=l,...,n). 



Proof 1 The problem in Eq. ([T]) is equivalent to 

argmin (rjt n - c) T f + Xf T Lf . 

/e{o,i}» 



(4) 



FromEq. Q . we can see that the second term in the above equation is equivalent 
to a cut function: a pair of nodes (p, q) increases the energy if and only if the 
nodes are included in different sides of the graph. This is clear because this term 
can be transformed from the definition of L as 



f T Lf 




9=1 / P>9=1 
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Moreover, as for the linear term in ([I]), we can see that, if we include the 
pth-node into S (i.e., f p — 1), the objective increases by r\ — c p if rj > c p or 
decreases by c p — r\ if c p > i]. Thus, if we define a vector f := [/ T l 0] T , we can 
still represent the objective in Eq. ^ as a cut function on graph with adjacency 
matrix M whose entries are defined for 1 < p, q < n by M pq — XW pq , 

M n+1 P = ( Cp ~ 11 > V and M p n+2 = ^ ~ C ? < V 
+ l ' p \ otherwise p ' + \ otherwise . 

Since now the (n + l)-th and (n + 2)-th nodes, which we refer to as s and t, do 
not correspond to nodes in the original graph, we can see that this is equivalent 
to the s/t min-cut problem stated in Proposition^ M 

It is therefore possible to use graph cuts to efficiently optimize the objective 
function defined in Equation [l] and select a small number of connected SNPs 
maximally associated with a phenotype. We refer to this method as SConES, 
for Selecting CONnected Explanatory SNPs. 



3 Results 

We evaluate the ability of SConES to detect networks of trait-associated SNPs 
on simulated datasets and on datasets from an association mapping study in 
Arabidopsis thaliana. 

3.1 Experimental Settings 

For all of our experiments, we consider the three SNP networks defined in Sec- 
tion [T] the genomic sequence network, the gene membership network, and the 
gene interaction network. For SConES, the association term c is derived from 



Linear SKAT ( Wu et al. . 2011 ), which makes it possible to correct for covariates 



(and therefore population structure). 

Linear regression As a baseline for comparisons, we run a linear-regression- 
based single-SNP search for association, and select those SNPs that are signifi- 
cantly associated with the phenotype (Bonferroni-corrected p- value < 0.05). 

Lasso To compare SConES to a method that also considers all additive effects 
of SNPs simultaneously with a sparsity constraint, but without any network 
regularization, we also run a Lasso regression ( |Tibshirani| 1 1 994| , using the SLEP 



implementation (Liu et al. 2009) of the Lasso 



ncLasso In addition, we compare SConES to the network-constrained Lasso 



ncLasso (Li and Li 20081, a version of the Lasso with sparsity and graph- 
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smoothing constraints equivalent to that of SConES. ncLasso solves the follow- 
ing relaxed problem (f £ K™): 



argmin ~ \ \Gf -r\\l + Xf T Lf + 77II/II1 



(5) 



We use the reformulation proposed by the authors together with the SLEP 
implementation Liu et al. (2009) of the Lasso. Note that this reformulation 



requires to compute and store a single value decomposition of L and is therefore 
not applicable in genome-wide settings where the size of L exceeds 100 000 x 
100 000 by far. 



groupLasso and graphLasso Eventually, we also compare our method to 
the non-overlapping group Lasso ( Jacob et al. 2009 ) . The non-overlapping 



group Lasso solves the following relaxed problem: 



1 



arg mm 



\Gf 



AEII/ e l 



(6) 



where Q is a set of (possibly overlapping) predefined groups of SNPs. We con- 
sider two versions of the non-overlapping group Lasso: 

• graphLasso, for which the groups are directly defined from the same net- 
works as considered for SConES as all pairs of vertices connected by an 
edge; 

• groupLasso, for which the groups are defined sensibly as follows: 

— Genomic sequence groups (GS): pairs of adjacent SNPs (note this 
gives raise to the same groups as for graphLasso with the sequence 
network) ; 

— Gene membership groups (GM): SNPs near the same gene; 

— Gene interaction groups (GI): SNPs near either member of two in- 
teracting genes. Here SNPs near genes that are not in the interaction 
network get grouped by gene. 



We use the SLEP implementation ( Liu et al. 2009 1 of the non-overlapping 



group Lasso, combined with the trick described by Jacob et al. ( 2009 ) to com 



pute the overlapping group Lasso by replicating features in non-overlapping 
groups. 



Setting the parameters All methods considered, except for the linear re- 
gression, have parameters (e.g. A and 77 in the case of SConES) that need to 
be optimized. In our experiments, we run 10-fold cross-validation grid-search 
experiments over ranges of values of the parameters: 7 values of A and rj each 
for SConES and ncLasso, and 7 values of the parameter A for the Lasso and 
the non-overlapping group Lasso (ranging from 10 -3 to 10 3 ). We then pick as 
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optimal the parameters leading to the most stable selection, defined as that for 
which the number of SNPs selected in all of the 10 folds is the largest. To avoid 
trivial solutions picking all or most of the SNPs, we discard results from pairs 
of parameters that lead to selection of more than 10% of the features. 

3.2 Runtime 

We first compare the CPU runtime of SConES with that of the linear regression, 
ncLasso and graphLasso. To assess the performance of our methods, we simulate 
from 100 to 200 000 SNPs for 200 individuals and generate exponential random 
networks with a density of 2% between those SNPs. 

We report the real CPU runtime of one cross-validation, for set parameters, 
over a single AMD Opteron CPU (2048KB, 2600MHz) with 512GB of memory, 
running Ubuntu 12.04 (Figure [3]). Across a wide range of numbers of SNPs, 
SConES is at least two orders of magnitude faster than graphLasso and ncLasso. 




100 500 1000 2500 5000 7500 10000 

#SNPs 




#SNPs 

Figure 3: Real CPU runtime comparison between linear regression, ncLasso, 
non-overlapping group Lasso and SConES, from 100 to 10 000 SNPs (left) 
and from 100 to 200 000 SNPs (right). After three weeks, ncLasso and non- 
overlapping group Lasso had not finished running for 25 000 SNPs. 
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3.3 Simulations 



To assess the performance of our methods, we simulate phenotypes for m = 500 
real Arabidopsis thaliana genotypes (214 051 SNPs), chosen at random among 



those made available by Horton et al. (2012), and the A. thaliana protein- 



protein interaction information from TAIR (The Arabidopsis Information Re- 



source 



2012) (resulting in 55 584 646 SNP-SNP connections). We use a window 



size of 20 000 base-pairs to define proximity of a SNP to a gene, in accordance 
with the threshold used for the interpretation of GWAS results in |Atwell et al.\ 
(2010). Restricting ourselves to 1,000 randomly picked SNPs with minor allele 
frequency larger than 10%, we pick 20 of the SNPs to be causal, and generate 
phenotypes yi = w T gi + e, where both the support weights w and the noise e 
are normally distributed. We consider the following scenarios: (a) the causal 
SNPs are adjacent in the genomic sequence; (b) the causal SNPs are near the 
same gene; (c) the causal SNPs are near any of five genes connected in a gene- 
gene-interaction-network. We then select SNPs using linear regression, Lasso, 
ncLasso, the two flavors of non-overlapping group Lasso, and SConES as de- 



scribed in Section 3.1 We repeat each experiment 50 times, and compare the 
selected SNPs of either approach with the true causal ones in terms of power 
(fraction of causal SNPs selected) and false discovery rate (FDR, fraction of 
selected SNPs that are not causal). 

As SConES returns a binary feature selection rather than a feature ranking, 
it is not possible to draw FDR curves or compare powers at same FDR as is 
often done when evaluating such methods. Figure [4] presents the average FDR 
and power of the different algorithms under the three scenarios, depending on 
the network used. The closer the FDR-power point representing an algorithm 
to the upper-left corner, the better this algorithm at maximizing power while 
mininmizing FDR. As it is easy to get better recall by selecting more SNPs, we 
also report on the same figure the number of SNPs selected by each algorithm, 
and show that it remains reasonably close to the true value of 20 causal SNPs. 
Eventually, we summarize those results with F-scores (harmonic mean of power 
and one minus FDR) in Table [TJ 

SConES is systematically better than its state-of-the-art comparison part- 
ners at leveraging structural information to retrieve the connected SNPs that 
were causal. While the performance of SConES and ncLasso does depend on 
the network, the non-overlapping group Lasso is much more sensitive to the 
definition of its groups. Finally, ncLasso is both slower and less performant 
than SConES. This indicates that solving the feature selection problem we pose 
directly, rather than its relaxed version, allows for better recovery of true causal 
features. 



3.4 Arabidopsis Flowering Time Phenotypes 

We then apply our method to a large collection of 17 A. thaliana flowering times 
phenotypes from |Atwell et aL| ( |2010j ) (up to 194 individuals, 214 051 SNPs). The 
groups and networks are again derived from the TAIR protein-protein interac- 
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(b) The true causal SNPs are near the same gene 
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(c) The true causal SNPs are near any of five interacting genes 

Figure 4: Power and false discovery rate (FDR) of SConES, compared to state- 
of-the-art Lasso algorithms and a baseline linear regression, in three different 
data simulation scenarios. Best methods are closest to the upper-left corner. 
Numbers denote the number of SNPs selected by the method. 
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Scenario 


(a) 


(b) 


(c) 


linear regression 


0.45 ± 0.00 


0.54 ±0.00 


0.16 ±0.00 


Lasso 


0.29 ±0.02 


0.27 ±0.01 


0.28 ±0.01 


GS 

ncLasso GM 
GI 


0.29 ±0.02 
0.29 ±0.02 
0.29 ±0.02 


0.30 ±0.01 
0.29 ±0.01 
0.29 ±0.01 


0.33 ± 0.01 

0.29 ±0.01 
0.29 ±0.01 


GS 

groupLasso GM 
GI 


0.26 ±0.01 
0.39 ±0.03 
0.15 ±0.03 


0.26 ±0.02 
0.49 ±0.02 
0.03 ±0.01 


0.29 ±0.01 
0.33 ± 0.01 

0.26 ±0.03 


GS 

graphLasso GM 
GI 


0.26 ±0.01 
0.25 ±0.01 
0.24 ±0.01 


0.26 ±0.02 
0.16 ±0.01 
0.20 ±0.01 


0.29 ±0.01 
0.28 ±0.01 
0.21 ±0.01 


GS 

SConES GM 
GI 


0.55 ± 0.02 
0.56 ± 0.03 

0.47 ±0.02 


0.51 ±0.02 
0.59 ± 0.03 
0.62 ± 0.03 


0.33 ± 0.01 

0.32 ±0.02 
0.34 ± 0.01 



Table 1: F-scores of SConES, compared to state-of-the-art Lasso algorithms 
and a baseline linear regression, in three different data simulation scenarios: (a) 
The true causal SNPs belong to the same genomic segment, (b) The true causal 
SNPs are near the same gene, and (c) The true causal SNPs are near any of five 
interacting genes. Best performance in bold. 



tion data. We filter out SNPs with a minor allele frequency lower than 10%. 
We use the first principal components of the genotypic data as covariates to 



correct for population structure (Price et al. 2006): the number of principal 



components was chosen by adding them one by one until the genomic control 
was close to 1. 

The direct competitors of SConES on this problem are the methods that 
also impose graph constraints on the SNPs they select, namely graphLasso and 
ncLasso. However, they do not scale to datasets such as ours with more than 
200k SNPs (see runtime in Figure [3]) . Hence we had to exclude them from our 
experiments. Instead, we compare SConES to groupLasso, which uses pairs of 
neighboring SNPs, SNPs from the same gene or SNPs from interacting genes 
as pre-defined groups. Note that groupLasso on sequence-neighboring SNPs is 
identical to graphLasso on the sequence network, which is the only instance of 
graphLasso whose computation is practically feasible on this dataset. We run 
Lasso, groupLasso and SConES on the flowering time phenotypes as described 
in Section |3~T1 

To evaluate the quality of the SNPs selected, we perform multivariate linear 
regression on each phenotype in a cross-validation scheme that uses only the 
selected SNPs and report its average Pearson's squared correlation coefficient 
in Figure [5] While the features selected by groupLasso±GS achieve higher 
predictivity than SConES±GS on most phenotypes, the features selected by 
SConES±GM are at least as predictive as those selected by groupLasso+GM in 
two thirds of the phenotypes; the picture is the same for SConES+GI, whose 
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features are on average more predictive than those of groupLasso+GI. 

Next, we checked whether the selected SNPs from the three methods coincide 
with flowering time genes from the literature. We report in Table [2] the number 
of SNPs selected by each of the methods and the proportion of these SNPs that 
are near flowering time candidate genes listed by Segura et al.\ ( |2012 ). Here, 
the picture is reversed: SConES+GS and groupLasso+GI retrieve the highest 
ratio of SNPs near candidate genes, while groupLasso+GS, SConES+GI and 
SConES+GM show lower ratios. At first sight, it seems surprising that the 
methods with highest predictive power retrieve the least SNPs near candidate 
genes. 

To further investigate this phenomenon, we record how many distinct flow- 
ering time candidate genes are retrieved on average by the various methods. A 
gene is considered retrieved if the method selects a SNP near it. Our results 
are shown in Table [3j Methods retrieving a large fraction of SNPs near candi- 
date genes do not necessarily retrieve the largest number of distinct candidate 
genes. Good predictive power, as shown in Figure[5j however, seems to correlate 
with the number of distinct candidate genes selected by an algorithm, not with 
the percentage of selected SNPs near candidate genes. groupLasso+GI has the 
highest fraction of candidate gene SNPs among all methods, but detects only 
three distinct candidate genes. This is probably due to groupLasso selecting 
entire genes or gene pairs; if groupLasso detects a candidate gene, it will pick 
most of the SNPs near that gene, which leads to its high candidate SNP ratio 
in Table [2] 

To summarize, SConES is able to select SNPs that are highly predictive of 
the phenotype. Among all methods, SConES+GM discovers the largest num- 
ber of distinct genes whose involvement in flowering time is supported by the 
literature. 



4 Discussion and Conclusions 

In this article, we defined SConES, a novel approach to multi-locus mapping 
that selects SNPs that tend to be connected in a given biological network with- 
out restricting the search to predefined sets of loci. As the optimization problem 
of SConES can be solved by maximum flow, our solution is computationally effi- 
cient and scales to whole genome data. Our experiments show that our method 
is about two orders of magnitude faster than the state-of-the-art Lasso-based 
comparison partners, and can therefore easily scale to hundreds of thousands 
of SNPs. In simulations, SConES is better at leveraging the structure of the 
biological network to recover causal SNPs. 

On real GWAS data from Arabidopsis thaliana, the predictive ability of the 
features selected by SConES is superior to that of groupLasso on two of the 
three network types we consider. When using more biological information (gene 
membership and gene information), SConES tends to recover more distinct ex- 
planatory genes than groupLasso, which in turns leads to better phenotypic 
prediction. 
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Table 2: Associations detected close to known candidate genes, for all flower- 
ing time phenotypes of Arabidopsis thaliana. We report the number of selected 
SNPs near candidate genes, followed by the total number of selected SNPs. 
Largest ratio in bold. "GS": Genomic sequence network. "GM": Gene mem- 
bership network. "GI" : Gene interaction network. 
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Table 3: Summary statistics, averaged over the Arabidopsis thaliana flowering 
time phenotypes: average total number of selected SNPs ("#SNPs"), average 
proportion of selected SNPs near candidate genes ( "near candidate genes" ) and 
average number of different candidate genes recovered ("candidate genes hit") 
. "GS" : Genomic sequence network. "GM" : Gene membership network. "GI" : 
Gene interaction network. 
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Figure 5: Cross-validated predictivity^fmeasured as squared Pearson correla- 
tion coefficient between actual phenotype and phenotype predicted by a linear 
regression over the selected SNPs) of SConES compared to that of the linear 
regression, Lasso, and groupLasso. 



The constraints imposed by groupLasso and SConES arc different: while the 
groups given to groupLasso and the networks passed to SConES come from the 
same information, the groups force many more SNPs to be selected simultane- 
ously when they may not bring much more information. This gives SConES 
more flexibility, and makes it less vulnerable to ill-defined groups or networks, 
which is especially desirable in the light of the current noisiness and incomplet- 
edness of biological networks. Our results on the genomic sequence network 
actually indicate that graphLasso, using pairs of network edges as groups, may 
achieve the same flexibility as SConES; unfortunately it is too computationally 
demanding to be run on the most informative networks. 

We currently derive the SNP networks from neighborhood along the genome 
sequence, closeness to a same gene, or proximity to interacting proteins. Refin- 
ing those networks and exploring other types of networks as well as understand- 
ing the effects of their topology and density is one of our next projects. 

Let us note that while we do not explicitly consider linkage disequilibrium, 
the Iq sparsity constraint of SConES should enforce that when several correlated 
SNPs are associated with a phenotype, a single one of them is picked. On the 
other hand, if SConES is given a genomic sequence network such as the one we 
describe, the graph smoothness constraint will encourage nearby SNPs to be 
selected together, leading to the selection of sub sequences that are likely to be 
haplotype blocks. Such a network should therefore only be used when the goal 
of the experiment is to detect consecutive sequences of associated SNPs. 

For now SConES considers an additive model between genetic loci. Future 
work includes taking pairwise multiplicative effects into account. Replacing 
the association term in Equation [T] by a sum over pairs of SNPs rather than 
over individual SNPs results in a maximum flow problem over a fully connected 
network of SNPs, which cannot be solved straightforwardly, if only because the 
resulting adjacency matrix is too large to fit in memory on a regular computer. 
It might be possible, however, to leverage some of the techniques used for two- 



locus GWAS ( |Achlioptas et |20Tlj |Kam- Thong et aL||2012[ ) to help solve this 
problem. 

Another important extension of SConES is to devise a way to evaluate the 
statistical significance of the set of selected SNPs. While SConES is speedy 
enough to allow for permutation tests, the effects of parameter selection on 
the computation of the p-value and of the optimization on multiple hypotheses 
testing remain to be evaluated. 

Finally, further exciting research topics include applying SConES to larger 
data sets from human disease consortia, and extending it to the detection of 
shared networks of markers between multiple phenotypes. 



Acknowledgments. 

The authors would like to thank Recep Colak, Barbara Rakitsch and Nino 
Shervashidze for fruitful discussions. 



17 



Funding: C.A. is funded by an Alexander von Humboldt fellowship. 



References 

Achlioptas, P., Scholkopf, B., and Borgwardt, K. (2011). Two-locus association 
mapping in subquadratic time. In Proceedings of the 17th ACM SIGKDD 
international conference on Knowledge discovery and data mining, KDD '11, 
pages 726-734, New York, NY, USA. ACM. 

Ando, R. K. and Zhang, T. (2007). Learning on graph with laplacian regular- 
ization. In Advances in Neural Information Processing Systems 19. 

Atwell, S., Huang, Y. S., Vilhjalmsson, B. J., Willems, G., Horton, M., Li, Y., 
Meng, D., Piatt, A., Tarone, A. M., Hu, T. T., Jiang, R., Muliyati, N. W., 
Zhang, X., Amcr, M. A., Baxter, I., Brachi, B., Chory, J., Dean, C, Debieu, 
M., Mcaux, J. d., Ecker, J. R., Faure, N., Kniskern, J. M., Jones, J. D. G., 
Michael, T., Nemri, A., Roux, F., Salt, D. E., Tang, C, Todesco, M., Traw, 
M. B., Weigel, D., Marjoram, P., Borevitz, J. O., Bergelson, J., and Nordborg, 
M. (2010). Genome- wide association study of 107 phenotypes in arabidopsis 
thaliana inbred lines. Nature, 465(7298), 627-631. 

Boykov, Y. and Kolmogorov, V. (2004). An experimental comparison of min- 
cut/max- flow algorithms for energy minimization in vision. IEEE Trans. 
Pattern Anal. Mach. Intell, 26(9), 1124 -1137. 

Cantor, R. M., Langc, K., and Sinshcimcr, J. S. (2010). Prioritizing GWAS 
results: A review of statistical methods and recommendations for their appli- 
cation. Am. J. Hum. Genet., 86(1), 6-22. 

Cho, S., Kim, K., Kim, Y. J., Lee, J.-K., Cho, Y. S., Lee, J.-Y., Han, B.- 
G., Kim, H., Ott, J., and Park, T. (2010). Joint identification of multiple 
genetic variants via elastic-net variable selection in a genome- wide association 
analysis. Ann. Hum. Genet., 74(5), 416-428. 

Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D., and Idckcr, T. (2007). Network- 
based classification of breast cancer metastasis. Mol. Syst. Biol., 3(140). 

Fridley, B. L. and Biernacka, J. M. (2011). Gene set analysis of SNP data: 
benefits, challenges, and future directions. Eur J Hum Genet. 

Goldberg, A. V. and Tarjan, R. E. (1988). A new approach to the maximum-flow 
problem. Journal of the ACM, 35(4), 921-940. 

Gretton, A., Bousquet, O., Smola, A., and Schlkopf, B. (2005). Measuring 
statistical dependence with hilbert-schmidt norms. In Proceedings of the 
16th International Conference on Algorithmic Learning Theory, pages 63-77. 
Springer- Verlag. 



18 



Horton, M. W., Hancock, A. M., Huang, Y. S., Toomajian, C, Atwell, S., 
Auton, A., Muliyati, N. W., Piatt, A., Sperone, F. G., Vilhjlmsson, B. J., 
Nordborg, M., Borevitz, J. O., and Bcrgclson, J. (2012). Genome-wide pat- 
terns of genetic variation in worldwide arabidopsis thaliana accessions from 
the RegMap panel. Nature Genetics, 44(2), 212-216. 

Huang, J., Zhang, T., and Metaxas, D. (2009). Learning with structured spar- 
sity. In Proceedings of the 26th Annual International Conference on Machine 
Learning, ICML '09, page 417424, New York, NY, USA. ACM. 

Jacob, L., Obozinski, G., and Vert, J. -P. (2009). Group lasso with overlap and 
graph lasso. In Proceedings of the 26th Annual International Conference on 
Machine Learning, ICML '09, pages 433-440, New York, NY, USA. ACM. 

Jie, B., Zhang, D., Wee, C.-Y., and Shcn, D. (2012). Structural feature selection 
for connectivity network-based MCI diagnosis. In P.-T. Yap, T. Liu, D. Shen, 
C.-F. Westin, and L. Shen, editors, Multimodal Brain Image Analysis, volume 
7509 of Lecture Notes in Computer Science, pages 175-184. Springer Berlin 
/ Heidelberg. 

Kam-Thong, T., Azencott, C.-A., Cayton, L., Piitz, B., Altmann, A., Karbalai, 
N., Samann, P., Scholkopf, B., Muller-Myhsok, B., and Borgwardt, K. (2012). 
GLIDE: GPU-Based Linear Regression for Detection of Epistasis. Hum Hered, 
73, 220-236. 

Le Saux, B. and Bunke, H. (2005). Feature selection for graph-based image 
classifiers. In J. Marques, N. Perez de la Blanca, and P. Pina, editors, Pattern 
Recognition and Image Analysis, volume 3523 of Lecture Notes in Computer 
Science, pages 147-154. Springer Berlin / Heidelberg. 

Lee, H. F. and Dooly, D. R. (1996). Algorithms for the constrained maximum- 
weight connected graph problem. Naval Research Logistics, 43, 985-1008. 

Li, C. and Li, H. (2008). Network-constrained regularization and variable selec- 
tion for analysis of genomic data. Bioinformatics , 24(9), 1175-1182. 

Liu, J., Ji, S., and Ye, J. (2009). SLEP: Sparse Learning with Efficient Projec- 
tions. Arizona State University. 

Liu, J., Huang, J., Ma, S., and Wang, K. (2012). Incorporating group correla- 
tions in genome- wide association studies using smoothed group lasso. Biostat. 

Mairal, J. and Yu., B. (2011). Path coding penalties for directed acyclic graphs. 
In Proceedings of the 4th NIPS Workshop on Optimization for Machine Learn- 
ing (OPT'll). 

Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., 
Hunter, D. J., McCarthy, M. L, Ramos, E. M., Cardon, L. R., Chakravarti, 
A., Cho, J. H., Guttmacher, A. E., Kong, A., Kruglyak, L., Mardis, E., 
Rotimi, C. N., Slatkin, M., Valle, D., Whittemore, A. S., Boehnkc, M., Clark, 



19 



A. G., Eichler, E. E., Gibson, G., Haines, J. L., Mackay, T. P., McCarroll, 
S. A., and Visscher, P. M. (2009). Finding the missing heritability of complex 
diseases. Nature, 461(7265), 747-753. 

Marchini, J., Donnelly, P., and Cardon, L. R. (2005). Genome-wide strategies 
for detecting multiple loci that influence complex diseases. Nat Genet, 37(4), 
413-417. 

Nacu, §., Critchley-Thorne, R., Lee, P., and Holmes, S. (2007). Gene expression 
network analysis and applications to immunology. Bioinformatics , 23(7), 
850-858. 

Papadimitriou, C. H. and Steiglitz, K. (1982). Combinatorial Optimization: 
Algorithms and Complexity . Prentice-Hall Inc., Englewood Cliffs, NJ. 

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., 
and Reich, D. (2006). Principal components analysis corrects for stratification 
in genome-wide association studies. Nat Genet, 38(8), 904-909. 

Rakitsch, B., Lippert, C, Stegle, O., and Borgwardt, K. (2012). A lasso multi- 
marker mixed model for association mapping with population structure cor- 
rection. Bioinformatics. 

Segura, V., Vilhjlmsson, B. J., Piatt, A., Korte, A., Seren, ., Long, Q., and 
Nordborg, M. (2012). An efficient multi-locus mixed-model approach for 
genome- wide association studies in structured populations. Nat Genet, 44(7), 
825-830. 

Smola, A. and Kondor, R. (2003). Kernels and regularization on graphs. In 

B. Scholkopf and M. Warmuth, editors, Learning Theory and Kernel Ma- 
chines, volume 2777 of Lecture Notes in Computer Science, pages 144-158. 
Springer Berlin / Heidelberg. 

The Arabidopsis Information Resource (2012). TAIR Protein-Protein 



Interaction. http : //www . arabidopsis . org/portals/proteome/ 

proteinlnteract . j sp . 

Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. J. R. 
Stat. Soc. Series B, 58, 267-288. 

Tsuda, K. (2011). Graph classification methods in chemoinformatics. In H. H.-S. 
Lu, B. Schlkopf, and H. Zhao, editors, Handbook of Statistical Bioinformatics, 
Springer Handbooks of Computational Statistics, pages 335-351. Springer 
Berlin Heidelberg. 

Wang, D., Eskridge, K., and Crossa, J. (2011). Identifying qtls and epista- 
sis in structured plant populations using adaptive mixed lasso. Journal of 
Agricultural, Biological, and Environmental Statistics, 16, 170-184. 



20 



Wu, M. C, Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare- 
variant association testing for sequencing data with the sequence kernel asso- 
ciation test. Am. J. Hum. Genet., 89(1), 82-93. 

Zuk, O., Hechter, E., Sunyaev, S. R., and Lander, E. S. (2012). The mystery of 
missing heritability: Genetic interactions create phantom heritability. Proc. 
Natl. Acad. Sci. USA, 109(4), 1193-1198. 



21 



